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Statistical classification of activities 
of molecules is a computer implemented 
meihodology of QSAR employing visualization 
of molecular features and statistical techniques 
for correlating features of molecules with their 
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described by noting the presence (\) or absence 
(0) of a feature of interest. The identification 
of specific features coded by Us or 0's is 
accomplished by recursive partitioning. The 
rlat:i sets are planned or unplanned. The method 
is also applicable to classification of individuals 
in biological populations on the basis of their 
genetic makeup. 



Chemical Structure 
data (Moifile) 



Biological Data 



± 



Bit String generation 




Recursive partitioning 
feature Identification 



2 Dimensional structure] 

drawing with fentjrer. 



FOR THE PURPOSES OF INFORMATION ONLY 



Codes used to identify States patty to the PCT on the front pages of pamphlets nublishu 



L.nder ine r J CT. 



AM 


Armenia 


VJS 
FT 


Spam 
Finland 


kS 
IT 


I rsrvhn 
Lithuania 


St 
SK 


Sbver.ia 
SI waki?. 


AT 


Austin 


FR 




i r 


I uiemhourg 


SN 


Senrp* 1 


At' 


Australia 


ga 


( lahon 


I.V 


Latvia 


SZ 


Swa/i ar.r! 


A/, 


Azerbaijan 


GB 


United Killed* Mr. 


MC 


Monaco 


TD 


Chad 


BA 


Bosnia and Hcr:eym ina 


GF 


Georgia 


MI) 


Republic of MoUova 


IX, 


logo 


BB 


Barbados 


GH 


Ghana 


MG 


Madaga^ar 


TJ 


Tajikistan 


BE 


Belgium 


GN 


Guinea 


MK 


The former Yugoslav 


TM 


Turkmenistan 


BK 


Burkina Faso 


CiK 


G recce 




Republic of Macedonia 


TR 


Turkey 


BG 


Bulgaria 


lit 


1 1 ungary 


Ml. 


Mali 


IT 


Trinidad and ToJ^p 


BJ 


Benin 


IK 


Ireland 


MN 


Mongclia 


I A 


I ] kramc 


BK 


Brazil 


IL 


biael 


MR 


Mfuiit mm 


IC, 


Cgiirida 


BY 


Belarus, 


IS 


Ireland 


MW 


Malawi 


LS 


l 1 n jtcd Vates v" * -i 


(A 


( 'inada 


IT 


Ualv 


MX 


Me*;, 






CK 


Central Atik an Rrf-uii;:. 















1)1 ir., 
Ilk D-.i..i:> 

Fh I ,!.-!■ < 



I ' \.. - 1 

I 1 l.rJ tc- 

I K Si i I 3i-W 



WO 98/47087 



PCT/US98/07899 



-1- 



STATISTICAL DECONVOLVING OF MIXTURES 



BACKGROUND OF THE INVENTION 

5 

A portion of the disclosure of this patent document contains material which is 
subject to copyright protection. The copyright owner has no objection to the 
facsimile reproduction by anyone of the patent document or the patent 
disclosure, as it appears in the Patent and Trademark Office patent file or 
1 0 records, but otherwise reserves all copyright rights whatsoever. 



This invention relates generally to computer assisted methods of analyzing 
chemical or biological activity and specifically to computer assisted methods of 
determining chemical structure-activity relationships, and determining which 
1 5 species in a mixture from a chemical or biological population can be predicted 
to have a given biological activity or biological phenotype. This method is 
particularly useful in the fields of chemistry and genetics. 



Combinatorial chemistry and high-throughput screening (HTS) are having a 
20 major impact on the way pharmaceutical companies identify new therapeutic 
lead chemical compounds. Voluminous quantities of data are now being 
produced routinely from the synthesis and testing of thousands of compounds 

in a high -thrOUghpUt b inr{icp-nn1 ^iv ^ 1 !■■■»..;«»■■ ni rnpmu-.nl lihnn-irs 

has, in effect, replaced the painstaking individual synthesis of compounds for 
25 biological testing with a strategy for the multiple synthesis of many 

compounds about a common structural core scaffold. Since there is such a 
low probability of identifying new lead compounds from screening 
programs, it is expected that the sheer number of compounds made via a 
combinatorial approach will provide many more opportunities to find novel 
30 leads. However, making and testing thons:>m^ rV ■■■■ 1 



, v^uaki.v wLAclupcii m ihc List accadc, tor the statistical 

;mai\ sis of a rclaa\ eh small number of compounds ( less than 1 00 ) are not 
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suitable for use on much larger data sets. Consequently, new technologies must 
be investigated. 

Various methods for the storage and retrieval of chemical structure/biological 
5 activity data have been devised. Software products are now available from 
major vendors that address most of the logistical needs of combinatorial 
chemistry. Little thought, however, has been given to how the data might best 
be used to guide future synthetic efforts once the biological activity of 
chemical compounds has been learned. One possible result from the synthesis 
10 and testing of large numbers of compounds is a short list of promising new 
lead compounds for further consideration. Many research programs stop here 
and immediately revert to traditional synthesis in order to optimize the new 
leads. On the other hand, others are seeking to continue along a combinatorial 
path have employed an evolutionary approach to make best use of all the data. 

15 

Genetic algorithms have also been used to select new chemical libraries to be 
made. However, due to the complex and specialized nature of the software used 
to identify 3D pharmacophores, it is unlikely that these methods will be able 
to routinely handle the volume of data andyor possible multiple binding modes 
20 or sites. 



For a number of years, there has been an interest in using artificial int rHv Tor "^ 
jp^nrin <o -inn- ,, ,] in . - mi ,-^ v ^ r hidden m1°s fr^m, ^r oth^rwisr H a s s i f y 
chemical datasets. Most have focused on reaction prediction. Others have used 
15 neural networks, fuzzy adaptive least squares and the like to analyze structure- 
activity datascts or predict chemical properties. Most of these methods are 
generally much too complex for routine structure-activity-relationship (SAR) 
analysis of large heterogenous data sets. 



30 Recursive pnrit^n- 



..npcac ar. anai_\.>;s :hat \> nased on a-sumption^ of linearity such as multiple 
linear regression (or basic QSAR i. principal component rc^v — 
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partial least squares (PLS). Various implementations of RP exist but none have 
been adapted to the specific problem of generating SAR. The present invention 
features a new computer program, Statistical Classification of Molecules using 
recursive partitioning ( SCAM), to analyze large numbers of binary descriptors 
5 (which are concerned only with the presence or absence of a particular feature) 
and to interactively partition a data set into active classes. 

SUMMARY OF THE IN V ENTION 

10 In brief summary, the invention is a computer-based method of encoding 
features of mixtures, whether the features be of individual data objects in a 
mixture or features of mixtures themselves, and of identifying and correlating 
those individual features to a response characteristic that is a trait of interest of 
the individual data object or of the mixture. The method is applicable to data 

1 5 objects in those types of data sets that are characterized in being a mixture of 
data object classes, each data object class containing one or more of the data 
objects, and wherein multiple data objects present a same trait of interest, but 
classes of data objects produce the response characteristic that is a trait of 
interest through different underlying mechanisms. The method comprises the 

20 steps of: assembling a set of descriptors and converting said set of descriptors 
into the form of a bit string such that each descriptor reflects the presence or 
absence of a potentially useful feature in a data object of interest: e ^mi™"? — - 

. e^rh rhta j!: ;^ i in piesence or absence of each of said descriptors; assembling 

the results of looking for descriptors into a vector for each data object, noting 

25 the presence or absence of each feature in said data object; assembling all 

vectors thus generated into a matrix; dividing the data in said matrix into two 
daughter sets on the basis of presence or absence of a given descriptor from 
said set of descriptors; and iteratively repeating this step until each member of 
said mixture has been classified into a group. The method is applicable to ihr.v 

^0 hrnnr1 situation'- 



. ■nctlieie>s icud t< ■ a pnc:iotvpicallv identical clinical di.-.ejse diagnosis. 
Sccondiv. those situations in which the dan obiects arc rhrrr^ J - ■ • • - : • • 
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e.g. a mixture of k chemical compounds tested together in a high throughput 
screen, or a mixture of dif ferent structural modes of a compound, and those 
data objects that show a given activity of interest do so in the same fashion or 
through the same underlying mechanism of action. And thirdly, those situations 
5 in which the data objects are mixtures and the active elements in the mixtures 
produce the same activity, but are acting through different mechanisms, for 
example, where k chemical compounds are screened together for activity and 
two of the compounds bind to a biological receptor, but bind to it in different 
places or in different conformations. Each of these three types of situations can 

10 be addressed whether they are planned or inadvertent mixtures. A planned 

mixture occurs where the fact of being a mixture is capable of manual control 
as is the case with carrying out a combinatorial synthesis, or where a high 
throughput screening is carried out with, for example, 20 compounds test 
together. An inadvertent mixture is said to be present whenever it is inherent in 

15 the situation, for example where there arc multiple structural conformations of a 
chemical compound, or where a data set contains compounds producing the 
same chemical result but acting by different mechanisms, or where a data set 
contains compounds producing the same biochemical result, but binding to 
different receptor sites or places, or where the data set is a human population 

20 having the same clinical disease, but the individuals have different genetic 
types coding for different underlying pathologies. 




25 Figure 1 is a schematic illustration of the process to identify important features 
of individual compounds in a mixture. 

Figure 2 is a schematic illustration of the process to identify important features 
of a mixture and identify active components. 

^; chemical structures. 
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Figure 4 is an illustration of a matrix having multiple vectors representing 
compounds. 

5 Figure 5 is an illustration of an analysis tree (also known as a Pachinko tree) 
generated using recursive partitioning as part of the invention in order to 
classify structural features of a group of chemical compounds. 

Figure 6 is an illustration of an analysis tree generated using recursive 
10 partitioning as part of the invention in order to classify genetic features of a 
population. 

DE TAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS AND 
BEST MODE OF THE INVENTION 

15 

The method of the present invention overcomes previous shortcomings in the 
chemical and biological arts. In a first preferred embodiment, Structure-activity 
relationships (SAR's) can be developed from large bodies of data generated as 
a result of high throughput screening (HTS), or combinatorial or other 
20 automated chemical syntheses. Such chemical syntheses outputs data sets 

composed of large numbers of structurally heterogeneous chemical compounds. 

Fir^t n r vL^iijjtuis is generated. Descriptors, as that terra is used in the 

present invention, are any type of descriptive notation that, in the context of 

25 chemistry, are chemically intcrprctable. have enough detail that they can 

capture useful chemical structural features, and arc capable of being described 
in terms of being present or absent in a given chemical compound, which in 
turn confers the ability to describe them computationally as a bit string. A 
partial, non-limiting list of descriptors can include: atom pairs, which set forth ;i 

30 spati:il-''in:'1iT*)t^-- - ^ ** : * 

....... . . ; : i 1 "■ . ■■■ 1 . .. , . . ... i j 1 1 1 1 1 ■*> \ * \ 

'U'juutr topological torsions; ar.v binary of continuous variables; or anv 
combination of any of theses types of descriptors. In the cort *vt , nc,,i 
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descriptor can be, mt.st preferably, a genetic marker, such that an individual 
subject in a population of interest either does or doesn't have the marker or a 
particular allele of a gene. 

5 For any of the above-listed descriptors, or any non-listed descriptors that 

otherwise fit the above stated criteria, it can readily be seen that for any single 
chemical compound under consideration, it can be stated that the compound 
either has or doesn't have the descriptor. This presence or absence of such a 
descriptor for a compound can be represented computationally as a bit string, 

10 by a series of 1 's or 0's, each representing presence or absence, respectively, of 
a given descriptor for the compound under consideration. Multiple descriptors 
of a given type are generated, and each chemical compound is compared 
against each descriptor for the presence or absence of each descriptor in the 
specified set of descriptors that can occur in a data set. This comparison process 

1 5 yields a bit string of l's and 0's, as the case may be, that constitute a vector. 
The vector's sequence of 1 's and 0's will be an identifier of the compound 
under consideration, defining it in terms of the set of descriptors that occur in 
the data set. 

20 Two types of descriptors can be exemplified. Atom pairs and atom triples are 
descriptors generated from the topological (2D) representation of a molecular 
structure. They are very simple d escriptors compose d ^ gt™n- - n m . ; 1 — 

-h'- -Miuiudi topological distance (i.e., the number of bonds) betw een them, or 

equivalently, the number of atoms in the shortest path connecting the atoms. 

25 Hach local atomic environment is characterized by three values: the atomic 
number, the number of non-hydrogen connections and one-half of all 
associated u-electrons. For example, the carbonyl carbon in acetone is encoded 
as [C\ 3, 1] whilst a terminal methyl carbon would be [C, 1, 0]. The code for 
the carbonyl oxygen is [O, 1,1]. Thus, for each structure, (n fn-1 Y\ n iv- tv-- 

Horn pair was then produced. In general, approximately ten thousand unique 
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types of atom pans arc generated for a typical data set of about one thousand 
structures. 

The second type of structural descriptor, atom triangles, or atom triples, have 
5 been used by several groups for molecular similarity searching and as search 
keys for 3D search and docking studies, Triangles of atoms with corresponding 
interatomic distance information are thought to be the most elemental portions 
of a pharmacophore. Our atom triangles differ from those previously defined. 
As an indication of interatomic distance, we consider only the length of the 

10 shortest path between each pair of atoms forming the triangle. For example, the 
triangle formed amongst the carbonyl oxygen and the two terminal methyls of 
acetone is [0,1,1] (2); [C, 1, 0] (2); and [C, 1, 0] (2). All possible triangles are 
generated and each is properly canonicalized to a unique form and then 
transformed into a bit string as with atom pairs. Often, depending upon the 

15 diversity and size of the data set, it is possible to generate hundreds of 

thousands to millions of unique atom triples. For a 90,000 compound data set 
there are on the order of over 2 million possible atom triples. 

A bit string is built computationally as long as the number of distinct features. 

20 e.g.. atom triples, in an initially specified data set. The bit string is initially 
populated with 0's. Any given 0 is changed to a 1 if a compound being 
examined has at least one atom triple of the type assigne d for f h nt ^ — • 

!-n ^uing. /as multiple compounds arc thus examined, a matrix of the type 

shown in Figure 4 is created, consisting of 1 1 s and ifs. Such a matrix can grow 

25 to extremely large size, with over 2,000,000 descriptors not being uncommon. 
However, since most of the positions will be 0\ denoting the absence of a 
descriptor for that compound, this means the matrix is sparse. A sparse matrix 
is computationally handled in the present invention by only keeping track of 
where the l 's are, and imputing the positions of the thus comprv: ; M- , 

7fi v;t , ' 

!:. Hie meantime, an cmpii ical.\ obtained database of the potencv ( for r-ome 
chemical or pharmacological reaction of mier- * ■ r ■ ' 
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mixtures being examined has been assembled. Taking the data consisting of 
the assembled 1 's and CTs in the matrix and the known potency for each 
compound, the task is to divide the data into two groups, with data objects with 
1 \s assigned to one group and data objects with O's assigned to the other, thus 
5 effectively splitting the data into less active and more active compounds. 

The best column to use to divide the data set must be found. This optimal 
column is found through the use of the tool known as recursive partitioning 
(RP). RP analysis generates a diagram as exemplified in Figure 5. In the 

1 0 diagram in Figure 5, the node at the top of the tree is designated as Node 0. It 
represents a population or set of 1650 compounds, some of which are active, 
but many of which are inactive, whose potency was previously determined 
(active compounds are assigned a score of 1 , 2 or 3, while inactive compounds 
are assigned a score of 0), and as a group is now said to have an average 

1 5 potency of 0.34. In general, the number of screened compounds needed to build 
a analysis tree of this type is at least 100 or more, with 200 or more being 
preferred and 1,000 or more being most preferred. Immediately under Node 0 is 
a description of an atom triple, C(l,2)-8-; C(2,l)-6-; and C(l,())-5-. The RP 
algorithm examines the difference in potency between groups where each triple 

20 (or any other descriptor) is present or absent. The RP algorithm has identified 
this triple as being the best atom triple to partition off active compounds from 
inactive c ompounds in the group ol' 1650, since this tripl e r 1 "!' 1 ^' ; " 11 1 ■ " 

r '- j^iLL Uillcicncc in average potency between all possible presence'absence 

pairs, the difference with the smallest p-valuc using a statistical test. The 

25 algorithm has here split off 37 compounds having this triple, and 37 is the 
number that appears in the next lower node to the right of Node 0 (all 
compounds not having this triple are split off to the left ). These 37 compounds 
have an average potency of 2.8, out of a maximum possible of 3. Thus, the 
algorithm has already identified an atom triple that is a chemical stnietM-* 



■ i ' " ; . 1 : . .m 2,o i c a > nunc 

:ic\t i>cM atom triple to partition ot't\icti\c compounds from inactive 
compounds in the remaining group of ; " T hi ^ ro';n- *' ^ , - t *- ; ■ 
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two compounds lacking the triple being split off to the left, and the remaining 
35 compound being split off to the right. The two compounds split off to the 
left have no activity, while the other 35 compounds have an average activity of 
2.94 out of a possible 3, as stated in the lowermost right side node, call a 
5 terminal node). Now a structure-activity relationship is seen in which the 

presence of the two defined triples reflects a high degree of average potency in 
the compound subgroup. A typical molecular structure bearing these two atom 
triples is given, and it can be said with relative confidence that molecules 
having this general structural core will be active in the screen of interest here 
10 (atoms marked with circles are those that belong to the defining atom triples for 
that node). 

However, it can be seen that two other good terminal nodes showed up in this 
analysis, resulting in three chemical structureclasses being generated in Figure 

15 5. When the first round of partitioning took place, the algorithm took the 

remainder of 1613 compounds and identified an atom triple tending to confer 
activity within that group, C(3,0)-2-;N(l,2)-2-;N(l,2)-3-, and partitioned that 
subgroup accordingly into two subgroups having average potencies of 2.3 and 
0.23, reflecting the presence or absence of that atom triple. The partitioning 

20 process continues until terminal nodes were reached, yielding three structure- 
activity relationships. These three structural cores can be seen to have 

somewhat different ch emistries. Thus the nrndnn 1 -y^" ,1t y thn ■ ^ 1 ..r 

may be the result of different biochemical/chemical mechanisms. RP can deal 
with such mixtures of compounds that follow different mechanistic paths. 

25 

Having developed such a tree, it is then possible to predict the activities of 
compounds that have not yet been empirically tested for activity. A given 
compound is analyzed for presence or absence of triples, or whatever the 
descriptor is that has been chosen, and then cascaded down the t*ve vitb >h>- 

o • ■ , llKl ^ , \ , .;.ie ii', i!'. the compound \> 

ekvronicaliv predicted, eliminating the need for high throughput screening of 
large numbers of compounds wh:ch wf ! - m h" - > J 1 ' : 
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activity. Those compounds having the greatest predicted activity are selectively 
tested, at great cost and time savings. 

It is important to understand that not only discrete compounds or individuals 
5 can be assigned to passed through nodes in the analysis tree, but mixtures 
themselves as well. Thus, a situation in which 1,000 pools each containing 10 
different compounds, isomers, confonners, etc., can be analyzed, in which each 
pool is now defined and analyzed in terms of descriptors present in the pools. 
Broadly speaking, discrete compounds or individuals are data objects ( an 
1 0 object that itself is not a mixture), but such pools are themselves also each a 
data object, which we refer to as a mixture object for greater clarity (i.e. an 
object that is itself a mixture). Whether an object is a data object or a mixture 
object, the object is analyzed in the same fashion using bit string assembly and 
recursive partitioning. 

15 

Situations commonly arise in which multiple binding modes exist by which 
several given compounds may be showing the same biological potency, but are 
doing so by binding to different available binding sites on a receptor molecule, 
a common situation in pharmacology. A related problem is that of a cell that 

20 presents more than one receptor site such that structurally differing molecules 
can elicit the same biological response from the cell. These problems are 
increased by orders of magni tude when comhmqtnm 1 tr ofi ^ rT ; '- 1 ,, T\~ 

problem nere is m figuring out what different structural features out of such a 

mix can confer activity and applying that knowledge to the design or screening 

25 of new compounds. The present invention can resolve such mixture problems 
by assembling a set of descriptors that can define a population of compounds 
and then proceeding with the rest of the analysis as described to arrive at 
structure-activity relationship rules out of the mixture. 



■ ." ' . , u; i i ;■. ■■ i pan -. l. present in a pool to 

nc analv/cd. I he method of ;he present invention can be used to find such \)du> 
in a poo; and Liuantifv the:r :vla t; \ e act'vt', - , . ■ 
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above, not only discrete compounds can be analyzed as data objects but also 
mixtures as mixture objects. Thus, where no individual compounds (objects) 
decode into a node, but one or more pairs of compounds (mixture objects) 
decode into the same node that shows a high average potency, then this result 
5 implies the discovery of a synergistic pair of compounds, with members of the 
the pair having the characteristics of the descriptors leading to that node. 
Synergistic triples, etc., of compounds can be found in like manner. 

In genetics, it is common for a population to have individuals in it that are 
10 different genotypes. It is now known that a great many diseases are controlled 
by not one, but multiple genes in an individual. These two factors present a 
huge problem in unraveling how to rationally target a drug therapy at a 
population of patients who may have the same clinical diagnosis, but whose 
pathology is being controlled by multiple possibly different genes within each 
1 5 patient. Until now, there has been no known satisfactory method for the 
identification of multiple interacting genes from large genomic data sets. 
However, the present invention addresses this by using alleles or combinations 
of alleles and/or gene markers as descriptors. Thus, as shown in Figure 6 ? a 
patient population of 1293 individuals had an average disease incidence of 

20 0.61 . The RP algorithm selects the gene marker aaxxx, present with two 

copies, to do a partition. This results in a subgroup of 86 individuals being split 
off to the right. 83% of whom had disease, while a snb^ rnim n r 1 

■ — ^ liun geiieuc marker is split off to the left, and having a disease incidence 

of 59%. The analysis is continued until terminal nodes are reached that lead to 

25 the prediction that the highest incidence of disease will occur in those 

individuals having two copies of the aaxxx gene but who do not have the gene 
dbbfyy, which thus appears to be linked to a protector gene that tends to confer 
protection from disease on an individual, since those that had the putative 
protector gene only had a 30% incidence of disease. I 'sine these re^rl*-- -m... 

: • .M' A .i. r\\ ■]■..• .!>>!) pi vuxt o: 

'■'iiv n iv.orc ot the scries ::iurker> u.sed as descriptors e* a near by gene. 
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Since the economics of high throughput screening favor screening mixtures of 
compounds, the questions then arise of how to analyze such pooled data, and 
how to pool them. In another preferred embodiment of the invention, RP can be 
5 used to analyze such pooled data. 

Discrete products of a combinatorial synthesis can be encoded and decoded by 
use of the present invention, since each vector as described above is an 
identifier of the features of a compound. A given compound from a 

1 0 combinatorial synthesis (especially a virtual synthesis, sec US Patent No. 5, 
463,564) is electronically dropped down an analysis tree and if it lands in a 
given terminal node showing high activity, it is now known to have both a high 
probability of activity by virtue of all descriptors assigned to each node through 
which it passed successfully. This eliminates screening and identification of the 

1 5 great majority of compounds in a virtual combinatorial library, as it is well 

known that the great majority of combinatorial discrete are chemical 'junk' that 
will not have any appreciable biological activity, but still have to be winnowed 
out of a combinatorial pool, currently at great wasted expense. 

20 SCAM was the software tool developed as part of the present invention to 
perform recursive partitioning by swiftly computing binary splits on a large 
number of descriptor variables. There ar e several aspects ^ f ir'p^r - n t |! - r r ' _ 

■ ■ — :v.^,Ju. Huge sparse matrices, tens of thousands of structures and millions of 

descriptors have to be handled, efficient binary splits on up to a million or more 

25 variables have to be routinely performed, and a useful bridge for the chemist 
between the statistical analysis and the actual structures have to be devised. 

Three files are produced prior to a SCAM analysis: ( 1 ) a data file containing 
the compound names and potencies; (2i a descriptor dictionary fir mpt-Mrim* - 



1 ' .. ^ > . . .ale > ;. .le^cnpto: 
v. m a I: .t ot the stricture.-, m which the descriptor is tound is stored. This is 
\it\ similar to the concent of irdnwt kr\ ^ 'iM'd ir mi'^ M " r 
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alternative is to store a list of descriptors that are found in each structure. 
However, the former is more efficient, since the t-tcst is performed on the 
activities of the structures associated with a particular descriptor. 

5 In contrast to data partitioning via continuous descriptor variables, binary 

classification trees can be computed very quickly and efficiently since there are 
far fewer and much simpler computations involved. For example, FIRM 
develops rules for splitting based on "binning" of continuous variables and 
amalgamating contiguous groups of variables. These processes add 
10 considerably to execution time and effectively limit the interactive nature of 
most general RP packages for large data sets. However, with binary data a 
parent node can only be split into two and only two daughter nodes. Splitting 
on a binary desenptor such as the presence or absence of an atom pair involves 
performing a t-test between the mean of the group that has the atom pair and 
15 the group that does not. The t- values for each rule as a potential split can then 
be compared using the largest t-statistic. The atom pair with the largest t- 
statistic is the splitting variable. Therefore, the p- value (a time-consuming part 
of the calculation) needs only to be computed for the most significant split. 
Adding to the speed is the fact that, frequently, either the group that has the 
20 atom pair or the group that does not have the atom pair is usually quite small. 
This fact can be exploited using an idea known as "updating" which can be 
applied to a well known expression for computing the s^™p^ 

"ml denotes the potencies in group 1 by a 7,aJ,...,a>;/ and group 2 by 



v J,v2 ) n assuming that group 1 is smaller than group 2 (m<n), the t- 

25 statistic for testing for a difference between group potency means is: 

SSX - Y (x , - ~x) . a = SX!m, SX - Y v ; 
where V 
SSV V (v ■ 7V 7 vv , V 





1 1 




— + - 
m n 




SX -h SSY 



\i. k" - ; . - / > / • )i. dciH-tc the potencies m i:;c paicnl nodi 1 . The Mini. S/.. 
was compiled tor the previous split so i: is availab 1 ' 1 * - r> 
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computing S\\ AY can be computed as the difference SY=SZ-SX. This 
technique is known as "updating 11 . 

A similar updating method can be used to compute AVSYand SSY. Note that: 

I - i 

5 so SSY can be computed using the sum of the data, SY, and the yum of the 

squared data which will be denoted by SYY. Having computed SXX, and having 
SZZ available, SIT can be computed by the relation SYY^-SZZ-SXX. Therefore, 
the t-statistic can be computed very quickly, having stored the sum of the data 
and the sum of the squared data from the previous split. 

10 

The partitioning is repeated until a stop criteria is met. Firstly, the process can 
stop if there is no statistical test (t-test is preferred) that achieves a specified 
level of statistical significance. Secondly, the process can stop if the mixtures in 
a node are homogeneous with respect to their measured property. Thirdly, the 
1 5 process can stop if the size of each terminal node is below a user specified 
value. 



Example Analysis- 
Use of RP to uncover substructural rules that govern the biologica l ^tivitv nf-L 

?0 <t*1 nf ^ f^ r \ , i l,„-,nur,KiMTihih 1 Tn^<\^n[\-| 



A scries of 1,650 MAOl's was used to illustrate the effectiveness of SCAM in 
analyzing large structure-activity datasets and producing SAR rules. Neuronal 
monoamine oxidase [ammc:oxygcn oxidoreductase(deammating) E.C. 1.4.3.4] 
inactivates neurotransmitters such as norepinephrine by converting the ammo 
group to an aldehyde. Inhibitors of this enzyme arc thought to be useful in the 
treatment n'M^r,.. ■ ' 



..!■.« m.lilul"^ , :. ; ^ i cs. 
■;. pn.inr.ucei.ucai researchers o:'MA» > ,h ;•; target for rational drug design in 
anti-depressant therapy. Biological activities were reported ;, i f<^'^ ■■ ■ - ' 
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MAOFs: 0 being inactive; 1, somewhat active; 2, modestly active and 3, being 
most active. Generating any type of QSAR from this dataset would previously 
have been considered by those of skill in the art to be relatively quite difficult, 
but use of the present invention in statistically determining SAR rules is now 
5 possible and relatively easy. 

Recursive partitioning was applied to this set of 1,650 activities and unique 
atom pairs and the resulting tree diagram is shown in Figure 1. Default settings 
were used to produce this tree; up to 10 levels of partitioning are allowed, each 
split is statistically significant (Bonferroni adjusted p-value < 0.01), and both 
10 positive and negative splits were allowed. The Bonferroni p-value is computed 
by multiplying the raw p-value by the number of variables examined at the 
node. Eleven significant splits were found although a high percentage, 79.5% 
(70/88), of the most active molecules are found in only 3 terminal nodes 
(shaded in gray). 

15 To facilitate the understanding of the splits of the data obtained from recursive 
partitioning, it was necessary to have a molecular viewer which could not only 
display molecules, but highlight the portions of the molecules described in the 
rules. SCAM is not locked into displaying only one type of descriptor, but 
rather passes the descriptor variables path to a node to an external program 

20 which highlights the appropriate atoms or bonds and then passes the structure 
along to a viewer. To SCAM, descriptors are just strings, an d it is up to . 

eXtP mal p™™-Hmr tn inl { 1./ .kuhc anH Hicplay thpm The external 

programs can be specified by simply specifying external environment 
variables. 

25 SCAM has an option that allows the user to enter a MDL SD-file containing the 
structures for the compounds. Rather than reading them directly into memory, 
as the files can be quite huge, a list of seek indices is computed once on the SD- 
file. Then, whenever the user requests to see the compounds at a node, it is a 
simple mattrr of r n r^nvir . j 



■\ hen examining the RP classification tree, it is often of great interest to see the 
distribution of potencies at a node and to see how a split at a node r^v-f^ r- 

tK- ^ ■ .- -I- ■ 1. :: . , 



WO 98/47087 



-16- 



PCT/US98/07899 



available to display the potency distribution at the node, with the potency 
distribution of the two daughter nodes overlaid in different colors The density 
plot is performed by weighting each point by a Gaussian kernel function with a 
configurable bandwidth. If the assay variability is known, then the assay 
5 standard deviation can be used for the bandwidth. 

AT tree 

Once the analysis has been completed, a file describing the rules that create an 
RP tree can be written to disk, and a utility program, Pachinko, can be invoked 

10 on a new dataset to find where the compounds in that datasct would fall in the 
classification tree. Thus, a set of compounds can be screened, analyzed with 
SCAM producing a classification tree, and then a w r hole corporate chemical 
compound collection, or even virtual chemical compound libraries can be 
dropped down the tree to suggest additional compounds for biological 

1 5 screening. With Pachinko it is also possible to divide data into training and 
validation datasets to test the predictive powers of the tree. 

With a large number of descriptor variables, it is often the case that there is 
more than one descriptor that would give rise to the same split at a node. These 

20 variables are considered to be perfectly correlated. When the variable 
associated with the most significant split has other perfectly correlated 
variables, all such descriptors at the n ode are stored so *bpt t^'^r -i 1 — ; 1 - - 

K 'u.^u iui us input to the Pachinko program. In the dataset used to create the 

tree, all correlated variables will be found within the structures at a right node, 

25 though, in theory, only one would be necessary in order for some novel 

structure to be placed there. Within the Pachinko program, there is an option to 
cither force all correlated variables to match for a rule to be satisfied, or else to 
have any one matching descriptor for the right path in a tree to be taken. 

it i r. . : 



i . M AM 



enw vinvcr. 
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1. File Menu 

File commands are used to import the data Files associated with SCAM, enter 
documentation, and send print output to a File. 



1. 



Import 



read the .dat File and store compound names and potencies in arrays; 
read the .des file and store descriptor codes and names m arrays; 
read the .bit File and create a matrix which has a row for each descriptor and, 
10 in each row, an array oF indices (into the compounds array) of all 

compounds that have that descriptor; 

1 .2 Read Structures 

calculate a set of seek indices into an SD File so that molecular structure information 
1 5 can be accessed quickly; 

1 .3 Edit Information Box 

allow the user to input information about the data set being analyzed; 

20 1.4 Print Tree 

write the current tree to a postscript File for later printing; 

quit SCAM; 



2 Menu free 

Most of the options in the tree menu operate on the currently active node, which the 
user indicates by positioning cursor over a node and clicking the left mouse button. 



in 



>t;?c f r >*u number nfvic-u ;pttu> 
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(best : = 0; /*holds the t-statistic for the best split*/ 

for everv desenptor in the data set do 

{ 

5 split the compounds in the active node into two groups 

according to whether or not they have ihe descriptor; 

if the descriptor appear in no or all compounds then 
bonferroni : = bonferroni - 1 ; 

10 else 

{ 

calculate the t-statistic for this split: 
t = 

15 

where: 

X = mean potency of compounds in left or right child 

a = standard deviation of compound potencies of node being split 

r| = number of compounds in left or right child 

20 



if /"largest t-statistic indicates the most significant split * 




is i 

compute the pvalue from tbest and multiply this by the bonferonm adjustment to get a 
value indicating the significance of the split; 



22 Delete Subtree 



while rree ;U i pn rrrrn active n^de ;r.a\inuii:i-dcpih AND 
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further splits can be found) do 

split a terminal node of the tree rooted at the currently active node 

2.4 View Structures 

5 iiltcr an SD file containing the compounds in the active node through an external progra 
which highlights the atoms in the compounds that correspond to the descriptor variables 
(including correlated ones) that got the compound to that node; 

send the filtered SD file to a viewer program (Project View); 

10 

2.5 Structures Clipboard 

copy the structures at the active node to the clipboard in the form of an SD file; 

2.6 Save Structures 

15 write all structures (with atom highlighting-see Section 2.4) within the active node to an 
file; 

2.7 List Node 

wnte a list of the compounds and potencies within the active node to an external file; 

20 

2.8 Node Potency Histogram 

draw a non-parametnc density plot of the potencies o f ^ -'^"^ - ■ 



2. ( ) Wnte Pachmko Subtree Rules 
25 write the rules that generated the tree rooted at the active node to an external file; 

2.10 Create .dat File for Node 

create a .dat file for the compounds in the active node; 

.:. .:. ■■; >" ■ 1 < I..,.: vi./jimiiu now nodc^ arc 



.pi it and how the tree is displayed; 
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Copynght 1997 by Glaxo Wellcome, Inc., all rights reserved, except as stated abov 

There is now set forth a pseudocode example for carrying out the function of 
5 prediction of activity of a molecule by Pachinko if rules from SCAM/Recursive 
Partitioning have been previously stored. 

For each rule used to split data; 
input Node Tree Position; 

10 

input Node Average; 
input Node Number Rules; 
15 input Node Rule Set: 

For each object to be predicted 

Current Tree Position: - "N"; 

20 

Object Activity: = Node Average at Current Tree Position; 
Input Object Name; 
25 Input Object Rule Set; 

While Node Number Rules at Current Tree Position is greater than 0 
for every rule rj , in Node Rule Set at Cunvtv 'tv- ^ - : 



( "urreni 1 rec Position : l wrrcnl Tree Position 
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next Rule Set; 
Current Tree Position — CurrentTree Position = "1"; 

5 

Object Activity:= Node Average at Current Tree Position; 
print Object Name, Object Activity; 
10 Copyright 1997, 1998 by Glaxo Wellcome, Inc., all rights reserved except as stated above 
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What is claimed is- 

1 . A computer-based method of encoding features of data objects, and of 
identifying and correlating individual said features to a response characteristic 
5 that is a trait of interest of the data object, applicable to data objects in a data set 
that is characterized in being a mixture of data object classes, each data object 
class containing one or more of said data objects, and wherein multiple data 
objects present a same or similar value of the trait of interest, but classes of data 
objects produce the response characteristic that is a trait of interest through 
1 0 different underlying mechanisms, 

comprising the steps of: 

(a) assembling a set of descriptors and converting said set of descriptors into 
15 the form of a bit string such that each descriptor reflects the presence or 

absence of any given potentially useful feature of interest in a data object of 
interest; 

(b) examining each data object for presence or absence of each of said 
20 descriptors; 

(c) assembling the resul ts of step (h) j ntn ^ wrtnr r,, r vr ) i ..j. ■ r 

mnmg tne presence or absence of each feature of interest in said data object; 

25 (d) assembling all vectors generated in step (c) into a matrix with each row 
of the matrix corresponding to a data object and each column corresponding to 
a feature of interest; 

(e I dividing the data in said matrix into two dauuhvr 11 1 
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(f) repeating step (e) until each member of said matrix has been identified in 
terms of presence or absence oi any given feature of interest from said set of 
descriptors and each of said members has been assigned to a terminal node. 

5 2. A computer-based apparatus system for allowing a user thereof to 
encode features of data objects, and to identify and correlate individual said 
features to a response characteristic that is a trait of interest of the data object, 
applicable to data objects in a data set that is characterized in being a mixture of 
data object classes, each data object class containing one or more of said data 
1 0 objects, and wherein multiple data objects present a same or similar trait of 
interest, but classes of data objects produce the response characteristic that is a 
trait of interest through different underlying mechanisms, comprising: 

(a) input means responsive to operator commands enabling an operator to 
1 5 specify a set of descriptors that are subsequently converted into a bit-string, 

such that each descriptor reflects the presence or absence of a potentially useful 
feature of interest in a data object of interest; 

( b) storage means for storing the assembled set of (a); 

20 

( c) memory means for executing programmed steps that examine each data 
object for presence or absence of each o f ^w^^.-** 

(d) means for assembling the results of (c) into a v irtual matrix with each 
25 row of the matrix corresponding to an object and each column corresponding to 

a feature of interest; 

(e) means for assigning each data object in said matrix recursively into one 
of two defined categories on the basis of presence or ■ r 

' ■ ... ... . >:-:-v!ipt>>:.- ^ui av-;i:r.al K> a terminal tunic: 

ami 
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(f) output means for visually displaying, using computer graphics, a 
relationship of said descriptors with said data objects and classes. 

5 

3. A computer software system having a set of instructions for controlling a 
general purpose digital computer in performing a desired function comprising: 

a set of instructions formed into each of a plurality of modules, each module 
10 comprising: 

(a) an input process responsive to operator commands enabling an operator 
to specify a set of descriptors and convert said descriptors into a bit string such 
that each descriptor reflects the presence or absence of a potentially useful 

15 feature of interest of a data object of interest, wherein each data object is a 
member of a data set that is characterized in being a mixture of data object 
classes, each data object class containing one or more of said data objects, and 
wherein multiple data objects present a same or similar trait of interest, but 
classes of data objects produce the response characteristic that is a trait of 

20 interest through different underlying mechanisms; 

fb) a data storage process for storing the a^omhlr^ — ^ ■ 



(c) a computational process tor executing programmed steps that examine 
25 each member of said mixture for presence or absence of each of said 

descriptors; 

(d) a computational process for assembling the results of (c) into a vector for 
each data object and a matrix for all vectors; 

7 0 

* ' ' ' a-,- i >. i pie.^cnce v i absence ol a 

;:i\c;: icamic oi intercut lmm said sc: ol desci ipiois aad icpeating such anulvsis 
until each member of said mixta 'v l \ ^ l - — i ♦ : — : • • 
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absence of each feature of interest from said set of descriptors and assigned to a 
terminal node; 

(0 a data storage process; and 

5 

( g) an output process for visually displaying, using computer graphics, a 
relationship of said descriptors with said data objects and classes. 

4. A computer-based method of encoding mixture features of planned 
mixtures or of inadvertent mixtures, or of a combination of planned or 
inadvertent mixtures, and of identifying and correlating individual said features 
to a response characteristic of the mixture object, wherein said mixture object is 
in a data set wherein multiple mixture objects comprising the data set present 
the same trait of interest through a common underlying mechanism; 

comprising the steps of: 

(a) assembling a set of descriptors and converting said set of descriptors into 
the form of a bit string such that each descriptor reflects the presence or 

20 absence of a potentially useful feature of interest in a mixture object; 

(b) examining e ach mixture object fV»rp rocnn ^ — ■■■■ - M "i 
descriptors; 

25 (c) assembling the results of step (b) into a vector for each mixture object, 
noting the presence or absence of each feature of interest in said mixture object; 

(d) assembling all vectors generated in srep (c) into a matrix with each row 
corresponding to a mixture obiect and each oohmM ro^v ^ • 

• . ■ ... . \ .;)(•• :\v o Jclmcd daughter 

r.uclo r'U the basis u: piescncc or absence of a given leature of interest from 
said set or" descriptor^; ?.v<\ 



10 



15 
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(f) repeating step (c) until each mixture object of said matrix has been 
identified in terms of presence or absence of given features of interest from said 
set of descriptors and assigned to a terminal node. 

5 

5. A computer-based apparatus system for allowing a user thereof to 
encode features of planned mixtures or of inadvertent mixtures, or of a 
combination of planned or inadvertent mixtures, and to identify and correlate 
individual said features to a response characteristic of the mixture object, 
10 wherein said mixture object is in a data set wherein multiple mixture objects 
comprising the data set present the same trait of interest through a common 
underlying mechanism, comprising: 

(a) input means responsive to operator commands enabling an operator to 
1 5 specify a set of descriptors that are subsequently converted into a bit string, 

such that each descriptor reflects the presence or absence of a potentially useful 
feature of interest in a mixture object of interest; 



(b) storage means for storing the assembled set of (a); 




id) means for assembling the results of (c) into a virtual matrix with each 
row corresponding to a mixture object and each column corresponding to a 
feature; 

(e) means for assigning each mixture object in said matrix recursively into 
one of two defined categories on the basis of p-t^w nr ->k ' 



, .. .an; , : ,;.t-j; c a Horn :,a\c ol descriptors 
ami a>M to a terminal node, and 
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( fl output means for visually displaying, using computer graphics, the 
relationships of said descriptors with said mixture classes and mixture objects. 

5 

6. A computer software system having a set of instructions for controlling a 
general purpose digital computer in performing a desired function comprising: 

a set of instructions formed into each of a plurality of modules, each module 
1 0 comprising: 

( a) an input process responsive to operator commands enabling an operator 
to specify a set of descriptors and convert said descriptors into a bit string such 
that each descriptor reflects the presence or absence of a potentially useful 
1 5 feature of interest in a mixture object of interest, wherein each mixture object is 
a member of a data set where each mixture object presents a same trait of 
interest through a common underlying mechanism; 

(b ) a data storage process for storing the assembled set of (a); 

20 

(c) a computational process for executing programmed steps that examine 
each member object of sai d data set for nr^wr or -iK^ v , .j'^in 
— ucsenptors; 

25 (d) a computational process for assembling the results of (c) into a vector for 
each mixture object and a virtual matrix with each row corresponding to a 
mixture object and each column corresponding to a feature; 

(e) a computational process for analwine the dru-i in <-*i^ m->* — • * 

" ■" i .... i , , j ii, ^ m . ^ i . : mi cyclic c or a nsence oi 

each IcaMac ol mle;e>t 1 u mi i saui sc: ot descriptors and assigned to a terminal 
node; 
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( f) z data storage process; and 



( g) an output process for visually displaying, using computer graphics, a 
5 relationship of said descriptors with said mixture objects and classes. 

7. A computer-based method of encoding mixture features of planned 
mixtures or of inadvertent mixtures, or of a combination of planned or 
inadvertent mixtures, and of identifying and correlating individual said features 

10 to a response characteristic that is a trait of interest of the mixture object, 
wherein said mixture object is in a data set that is characterized in being a 
mixture of mixture object classes, each class containing one or more of said 
mixture objects, and wherein multiple mixture objects present a same trait of 
interest but classes of mixture objects produce the response characteristic 

1 5 which is a trait of interest through different underlying mechanisms, 



comprising the steps of: 



(a) assembling a set of descriptors and converting said set of descriptors into 
20 the form of a bit string such that each descriptor reflects the presence or 

absence of a potentially useful feature of interest in a mixture object of interest; 



( h) examining each mixture object for presence or absence of each of said 
descriptors; 



( c) assembling the results of step (b) into a vector for each mixture object, 
noting the presence or absence of each feature in said data object; 



(d) assembling all vectors generated in step (o ■> t ^ 



■ v.-.i.-.i > i^Niiai: i\ ;r.:< t\\ o dji mcd daughter 
in 'do on the basis of presence oi absence of a g:\cn feaiure of interest from 
said set of .lesorptors; an! 
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(0 repeating step (e) until each mixture object of said matrix has been 
identified in terms of presence or absence of given features of interest from said 
set of descriptors and assigned to a terminal node. 

5 

8. A computer-based apparatus system for allowing a user thereof to 
encode features of planned mixtures or of inadvertent mixtures, or of a 
combination of planned or inadvertent mixtures, and to identify and correlate 
individual said features to a response characteristic that is a trait of interest of 

10 the mixture object, applicable to mixture objects in a data set that is 

characterized in being a mixture of mixture object classes, each class containing 
one or more of said mixture objects, and wherein multiple mixture objects 
present a same trait of interest, but classes of mixture objects produce the 
response characteristic that is a trait of interest through different underlying 

15 mechanisms, comprising: 



(a) input means responsive to operator commands enabling an operator to 
specify a set of descriptors that are subsequently converted into a bit string, 
20 such that each descriptor reflects the presence or absence of a potentially useful 
feature of interest in a mixture object of interest; 



25 (c) memory means for executing programmed steps that examine each 
mixture object for presence or absence of each of said descriptors; 

(d) means for assembling the results of (c ) into a virtual matrix with each 
row corresponding to a mixture object and each colr.mr <w'-^tv>» , • 



. 'wj. - i \ / >*a-,; :natri \ recursively into 

one ot tne defined categories or the basis of presence or absence of a mven 



leature of interest from said 



o -rot 
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each mixture object of said matrix has been classified in terms of presence or 
absence of given features of interest from said set of descriptors and assigned to 
a terminal node; and 

5 (f) output means for visually displaying, using computer graphics, the 
relationships of said descriptors with said mixture objects and classes. 

9. A computer software system having a set of instructions for controlling a 
10 general purpose digital computer in performing a desired function comprising: 

a set of instructions formed into each of a plurality of modules, each module 
comprising: 

15 (a) an input process responsive to operator commands enabling an operator 
to specify a set of descriptors and convert said descriptors into a bit string such 
that each descriptor reflects the presence or absence of a potentially useful 
feature of interest in a mixture object of interest, wherein each mixture object is 
a member of a data set that is characterized in being a mixture of classes, each 

20 class containing one or more of said mixture objects, and wherein multiple 

mixuire objects present the same trait of interest, but classes of mixture objects 
produce the response char acteristic that is ;i tmit p*'-j^r°^ ♦u..-,.^ r^, M M r 
underlying mechanisms, 

25 (b) a data storage process for storing the assembled set of (a); 

(c) a computational process for executing programmed steps that examine 
each mixture object of said matrix for presence or absence of each of said 
descriptors; 

-1 1 ■> ,\ u.i i 1 1 i . >v. i e^poncing to a 

mixture oh-cct and each column corresponding to a feature; 
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(c) a computational process for assigning each mixture object in said matrix 
into one of two defined categories on the basis of presence or absence of a 
given feature of interest from said set of descriptors and repeating such analysis 
5 until each member of said matrix has been classified in terms of presence or 
absence of given features of interest from said set of descriptors and assigned to 
a terminal node; 

(f) a data storage process; and 

10 

(g) an output process for visually displaying, using computer graphics, a 
relationship of said descriptors with said mixture objects and classes. 



15 



20 



25 



10. A computer-based method of analyzing biological potency of individual 
chemical structure features out of a plural mixture of chemical compounds 
wherein a created data set is characterized in being a mixture of data objects, 
each data object itself being a mixture of active and/or inactive chemical 
compounds, which active chemical compounds exhibit a trait of interest, 
w herein the underlying mechanisms of activity may be through a single or 
multiple mechanisms, comprising the steps of: 



{ 'd) assembling a set of descriptors such that each descriptor captures a 
chemically useful feature of one or more members of a mixture of chemical 
compounds such that one member is captured if individual chemical 
compounds are being decoded, two members are captures if pairs of chemical 
compounds are being decoded, three members arc captured if triples of 
chemical compounds are being decoded and so on; 



WO 98/47087 PC77US98/07899 

-12- 



(c) assembling the results of step (b) into a descriptor vector; 

(d) comparing the features of the individual compound, pair, triple and so 
forth, to the features of a terminal node of choice and determining a resident 

5 terminal node; 

(e) repeating step (d) until each compound, pair, triple and so forth of said 
set of mixtures of chemical compounds has been identified and characterized in 
relation to the terminal node it would reside within. 

10 

1 1. The method as claimed in claims 1, 4, 7 or 10, including the additional 
step of assembling a chemical structure data file. 

12. The method as claimed in claim 1, 4, 7 or 10, including the additional 
1 5 step of assembling biological data pertaining to each chemical mixture or 

mixture of chemicals and assigning each chemical mixture its biological data. 

13. The method as claimed in claim 1, 4, 7 or 10, in which said correlation is 
between presence or absence of one or more chemical descriptors and 

20 biological activity of a chemical mixture. 

1 4. The method as clai med in claim 1 4 7 or 1 n -i- - - 1 , M ; 1MrN ' 
is net ween presence or absence of one or more chemical descriptors and 
pharmacological activity of a chemical compound. 

25 

15. The method as claimed in claims 1, 4, 7 or 10, including the additional 
step of determining structure-activity relationships, such relationships 
comprising sets of rules defining the sets of features specific to each activity 
class. 
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17. The method as claimed in claim 1, in which said descriptor is an atom 
triple. 

18. The method as claimed in claim 17, in which said atom triple is a set of 
5 three defined atoms in a molecule of interest, each atom defined by element, by 

spatial relation to each of the other two atoms, and by the type of chemical 
bond or number of chemical bonds separating them in the molecule. 

19. The method as claimed in claim 1, in which said descriptor is a 
1 0 molecular fragment. 

20. The method as claimed in claim 1, in which said descriptor is a 
molecular topological torsion. 

15 21. The method as claimed in claim 1 , in which said descriptor is a measure 
of thermodynamic stability. 

22. The method as claimed in claim 1, in which said descriptor is a binary of 
continuous variable. 

20 

23. T he method as claimed in claim 1, in which said descriptor is a 
combination in any o rder of an atom p -^ r rm runm tr; n V ,, , .i, , „ : ,m»«Mii 
a molecular topological torsion, thermodynamic stability or a binary of a 
continuous variable. 

25 

24. The method as claimed in claim 1, in which each descriptor is an 
clement of a vector in said matrix. 

25. The method as claimed in claim 1. in which nn- '-^ - - 1 - 



■■' ' ..... . .... . ,. . Hi ' \ i , k i\ \ cctor is 

HiKv.iMMonailv represented as a bit string data tile. 
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27, The method as claimed in claim 26, in which said bit string data file is 
utilized to computationally create a bit string data file. 

5 28. The method as claimed in claim 26, in which said bit string is 
computationally compressed into a sparse matrix. 

29. The method as claimed in claim 28, in which said sparse matrix is 
statistically analyzed by recursive partitioning. 

10 

30. The method as claimed in claim 29, in w hich said recursive partitioning 
is performed by the CART method. 

3 1 . The method as claimed in claim 29, in which said recursive partitioning 
1 5 is performed by the FIRM method. 

32. The method as claimed in claim 29, in which said recursive partitioning 
is performed by the C4.5 method. 

20 33. The method as claimed in claim 31, in which said FIRM method is 
converted from multiway splits to binary splits. 



j nc meihod as claimed in claim 1. including the additional step of 
selecting the descriptor that optimally divides said rows of said data matrix into 
25 two subsets of rows, being either compounds or mixtures of compounds where 
said feature of interest is present or absent, respectively, and repeating this 
process through subsequent iterations until all descriptors in said descriptor set 
have been examined repeatedly and all said rows assigned to terminal nodes. 



7r. 



- :< i I he method as claimed ;i; chum 1 . in which said data objects are discrete 
compounds. 
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37. The method as claimed in claims 4, 7 or 10, in which said data objects 
are mixtures of discrete compounds. 

5 38. A computer-based method of encoding, decoding and identifying 

individual chemical compounds out of a chemical mixture, comprising the steps 
of: 

(a) assembling the results of previously conducted screening of the chemical 
1 0 mixture for a biological activity of interest; 

( b) assembling a set of descriptors such that each descriptor captures a 
chemically useful feature of one or more members of a chemical mixture; 

1 5 ( c) examining each combination of members of said chemical mixture for 
presence or absence of each of said descriptors; 

(d) correlating presence or absence of said chemical descriptors with an 
assigned terminal node, thereby identifying predicted activity; and 

(e) analyzing subsequent chemical mixtures for chemical structure, 
comparing their ch emical structure np njn ct ;^ r i ^ ^ J 1 n - m-T ~ 
extrapolating biological reactivity of such subsequent chemical mixtures 
therefrom. 

39, The method as claimed in claim 38, including the additional step of 
assembling a chemical structure data file. 

40. 1 he method as claimed in claim ^ . inchdir" ■■ m;.. 



20 



25 
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41. The method as claimed in claim 38, in which said correlation is between 
presence or absence of one or more chemical descriptors and biological activity 
of a chemical compound or mixture. 

5 42. The method as claimed in claim 38, in which said correlation is between 
presence or absence of one or more chemical descriptors and pharmacological 
activity of a chemical compound or mixture. 

43. The method as claimed in claim 38, in which said descriptor is an atom 
10 pair. 

44. The method as claimed in claim 38, in which said descriptor is an atom 
triple. 

15 45. The method as claimed in claim 44, in which said atom triple is a set of 
three defined atoms in a molecule of interest, each atom defined by element, by 
spatial relation to each of the other two atoms, and by the type of chemical 
bond or number of chemical bonds separating them in the molecule. 

20 46. The method as claimed in claim 38, in which said descriptor is a 
molecular fragment. 

h 1 he method as claimed in claim 38. in which said descriptor is a 
molecular topological torsion. 

25 

48. The method as claimed in claim 38. in which said descriptor is a binary 
of continuous variables. 

49. The method as claimed in claim 3S. in which H<^-n-^- ■■■ 
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50. The method as claimed in claim 38, in which presence or absence of 
each feature of intererest is represented as a 1 or a 0. respectively. 

5 1 . The method as claimed in claim 38, in which said vector is 
5 computationally represented as a bit string. 

52. The method as claimed in claim 38, including the additional step of 
decoding the chemical compounds in said chemical mixture by reference to 
said matrix vectors for the mixture. 

10 

53. The method as claimed in claim 38, in which said recursive partitioning 
is graphically represented as a recursive partitioning analysis tree. 

54. A computer-based method of encoding, identifying and correlating 
15 individual genetic features of a genetic polymorphism out of a plural 

populational mixture of individual subjects so as to identify useful diagnoses 
and therapies of individuals and in the identification of genes and gene products 
useful in defining biological targets of interest, comprising the steps of: 

20 (a) assembling a set of descriptors such that each descriptor captures a 

genetically useful feature, allele, alleles, or marker, of one or more members of 
a mixture populati on of individunk h n y^ ? ph ' / ' , r l.n ■ i-m " 

(b) examining each member of said population of individuals for presence 
25 or absence of each of said genetic f eatures; 

(c) assembling the results of step (b) into a matrix; 

( d) dividing the data in said matrix into oro of ^ - 1 ; 
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(e) repeating step (d> until each member of said population of individuals 
has been identified and characterized in terms of presence or absence of each 
genetic feature; and 

5 (0 correlating presence or absence of said genetic features with known 
phenotypes of each of said mixture population of individuals, thereby deriving 
a relationship between genotype and phenotype, said relationship useful in 
diagnosis and therapy of individuals and also useful for identification of gene 
products, said gene products useful for selecting drug targets or said gene 
10 products useful for determining the genetic origiSn of a disease. 

55. The method as claimed in claim 54, including the additional step of 
assembling a populational phenotype data file. 

15 56. The method as claimed in claim 54, in which said descriptor is an 
identified allele or marker. 

57. The method as claimed in claim 54, in which said descriptor is absence 
of a given allele or marker. 

20 

58. The method as claimed in claim 54, in which each descriptor is an 
element of a vector in sai d matri.v . 

The method as claimed in claim 54, in which each individual m said 
25 population is encoded by a vector in said matrix. 

60. The method as claimed in claim 54, in which presence or absence of 
each descriptor is represented as a 1 or a 0, respectively. 



(0. I he met hud as claimed m el ami 54. ai v. Inch said bit sin runs utilized to 
computationally create a hit 1 * iV 
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63. The method as claimed in claim 54, in which said bit string is 
computationally compressed as a sparse matrix. 

5 64. The method as claimed in claim 54, in which said sparse matrix is 
statistically analyzed by recursive partitioning. 

65. The method as claimed in claim 54, in which said recursive partitioning 
is performed by the CART method. 

10 

66. The method as claimed in claim 54, in which said recursive partitioning 
is performed by the FIRM method. 

67. The method as claimed in claim 54, in which said recursive partitioning 
1 5 is performed by the C4.5 method. 

68. The method as claimed in claim 54, in which said FIRM method is 
converted from multiway splits to binary splits. 

20 69. The method as claimed in claim 54, including the additional step of 
selecting the descriptor that correlates most closely with the highest average 

incidence of a phenotype of interest of all in d" H 11 ^ 1 " in p< 1 ■" ] ■ ■ 

nave such a descriptor and creating two subsets of individuals where said 
descriptor is present or absent, respectively, and repeating this process through 

25 subsequent iterations until all descriptors in said descriptor set have been 
examined and analyzed for prevalence in said population. 

70. The method as claimed in claim 54. including the additional step of 
decoding the individuals in said population hv reference ^ ^ 1 

■ ■ ■ \\a 1 1 . . • i , : . _ ..:..:.\ i ^ tree. 



WO 98/47087 



-40- 



PCT/US98/07899 



72. The method as claimed in claim 54, in which said statistical test for 
splitting a node is a t-test. 

73. The method as claimed in claim 54, in which said statistical test for 
5 splitting a node is a ehi-square test. 
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